Please Do Not Distribute

Project ideas

You have to come up with an initial project idea in 3 weeks. You can start with a disease, a biological question, a method, and others. Please identify and read a recent publication that are interesting to you, and make sure that dataset is publicly available. I expect you to prepare a 10 min presentation, explaining what this data is about, what/why you are interested in, and what you would like to accomplish.

Homework

Within this notebook, there are five problems for you to complete. These problems are written in a blockquote:

Homework Problem Example 1: Make a figure.

Dependencies

Install the main package we’ll be using in this notebook, sva. The data is prepared and contained in the library bladderbatch. R packages on CRAN can be installed with install.packages(). Bioconductor packages are installed by using BiocManager::install(). There may be challenges in installation procedures. So if basic commands don’t work, please search.

options(repos = c(CRAN = "https://cloud.r-project.org/"))
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("sva")
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
##     CRAN: https://cloud.r-project.org/
## Bioconductor version 3.20 (BiocManager 1.30.25), R 4.4.2 (2024-10-31)
## Warning: package(s) not installed when version(s) same as or greater than current; use
##   `force = TRUE` to re-install: 'sva'
install.packages(c("devtools", "broom"))
## 
## The downloaded binary packages are in
##  /var/folders/s2/8k6_19794tg76ppn72tt08y00000gn/T//Rtmp501sE2/downloaded_packages
BiocManager::install(c("Biobase", "bladderbatch"))
## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
##     CRAN: https://cloud.r-project.org/
## Bioconductor version 3.20 (BiocManager 1.30.25), R 4.4.2 (2024-10-31)
## Warning: package(s) not installed when version(s) same as or greater than current; use
##   `force = TRUE` to re-install: 'Biobase' 'bladderbatch'
  library(devtools)
  library(Biobase)
  library(sva)
  library(bladderbatch)
  library(broom)
  library(tidyverse)
  library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose

Correlation between measured technical factors and PCs

As shown in Alter et al. and Johnson et al., PCA, factor models, or other related methods can be used to identify and correct for batch effects.

Using the Bottomly et al. data from the last week, we will compute correlation between some known variables and the top PCs. Load this data:

con = url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bottomly_eset.RData")
load(file=con)
close(con)
save(bottomly.eset, file="bottomly.Rdata")

load(file="bottomly.Rdata")
ls()
## [1] "bottomly.eset" "con"
edata <- as.matrix(exprs(bottomly.eset))
dim(edata)
## [1] 36536    21
edata[1:5,1:5]
##                    SRX033480 SRX033488 SRX033481 SRX033489 SRX033482
## ENSMUSG00000000001       369       744       287       769       348
## ENSMUSG00000000003         0         0         0         0         0
## ENSMUSG00000000028         0         1         0         1         1
## ENSMUSG00000000031         0         0         0         0         0
## ENSMUSG00000000037         0         1         1         5         0
sumna <- apply(edata, 1, function(x) sum(is.na(x)))
row.variances <- apply(edata, 1, function(x) var(x))
row.means <- apply(edata, 1, function(x) mean(x))
plot(row.variances, row.means, pch=19, main="Mean vs. Variance relationship")

hist(row.means,500)

edata <- edata[rowMeans(edata) > 10, ]
edata <- log2(as.matrix(edata) + 1)

Compute SVD and visualize the top 2 PCs. And using the meta data, we can color each data point accordingly:

edata <- t(scale(t(edata), scale=FALSE, center=TRUE))
svd.out <- svd(edata)

PC = data.table(svd.out$v,pData(bottomly.eset))
ggplot(PC) + geom_point(aes(x=V1, y=V2, col=as.factor(strain)))

ggplot(PC) + geom_point(aes(x=V1, y=V2, col=as.factor(lane.number)))

ggplot(PC) + geom_point(aes(x=V1, y=V2, col=as.factor(experiment.number)))

Compute correlation between a PC and each of measured variables (strain, lane.number, and experiment.number):

print(cor(pData(bottomly.eset)$experiment.number, svd.out$v[,1], method="spearman"))
## [1] -0.5489913
print(cor(pData(bottomly.eset)$experiment.number, svd.out$v[,2], method="spearman"))
## [1] 0.5104656
print(cor(pData(bottomly.eset)$lane.number, svd.out$v[,1], method="spearman"))
## [1] -0.2280568
print(cor(pData(bottomly.eset)$lane.number, svd.out$v[,2], method="spearman"))
## [1] -0.2241248
print(cor(pData(bottomly.eset)$experiment.number, svd.out$v[,1], method="spearman"))
## [1] -0.5489913
print(cor(pData(bottomly.eset)$experiment.number, svd.out$v[,2], method="spearman"))
## [1] 0.5104656
print(cor(pData(bottomly.eset)$lane.number, svd.out$v[,1], method="spearman"))
## [1] -0.2280568
print(cor(pData(bottomly.eset)$lane.number, svd.out$v[,2], method="spearman"))
## [1] -0.2241248

While this approach was once popular, it has many disadvantages. Particularly, we need a large sample size (per batch) and some of PCs may involve both batch effects and biological signals. In some cases, it’s possible that many PCs are moderately related to some technical factors.

Bladder Cancer Gene Expression

We use the microarray gene expression data on 57 bladder samples, that were processed in 5 batches. This dataset is available as a R/Bioconductor package, bladderbatch. This dataset is well known to have been confounded. Read the original study:

Gene expression in the urinary bladder: a common carcinoma in situ gene expression signature exists disregarding histopathological classification.

As this is also an ExpressionSet, the steps to extract expression data and meta data are identical to the last 2 weeks.

library(bladderbatch)
data(bladderdata)

# sample info
pheno = pData(bladderEset)
# expression data
edata = exprs(bladderEset)

dim(pheno)
## [1] 57  4
dim(edata)
## [1] 22283    57
#edata <- edata[1:10000,]
edata[1:5,1:10]
##           GSM71019.CEL GSM71020.CEL GSM71021.CEL GSM71022.CEL GSM71023.CEL
## 1007_s_at    10.115170     8.628044     8.779235     9.248569    10.256841
## 1053_at       5.345168     5.063598     5.113116     5.179410     5.181383
## 117_at        6.348024     6.663625     6.465892     6.116422     5.980457
## 121_at        8.901739     9.439977     9.540738     9.254368     8.798086
## 1255_g_at     3.967672     4.466027     4.144885     4.189338     4.078509
##           GSM71024.CEL GSM71025.CEL GSM71026.CEL GSM71028.CEL GSM71029.CEL
## 1007_s_at    10.023133     9.108034     8.735616     9.803271    10.168602
## 1053_at       5.248418     5.252312     5.220931     5.595771     5.025180
## 117_at        5.796155     6.414849     6.846798     5.841478     6.352257
## 121_at        8.002870     9.093704     9.263386     7.789240     9.834564
## 1255_g_at     3.919740     4.402590     4.173666     3.590649     4.338196

Dive into data about samples

It is important to look at phenotype data, to check out seq lanes, prep dates, and other experimental batches. Then, remove rows with NA, NaN, etc.

head(pheno) 
##              sample outcome batch cancer
## GSM71019.CEL      1  Normal     3 Normal
## GSM71020.CEL      2  Normal     2 Normal
## GSM71021.CEL      3  Normal     2 Normal
## GSM71022.CEL      4  Normal     3 Normal
## GSM71023.CEL      5  Normal     3 Normal
## GSM71024.CEL      6  Normal     3 Normal
sumna <- apply(edata, 1, function(x) sum(is.na(x)))
row.variances <- apply(edata, 1, function(x) var(x))
row.means <- apply(edata, 1, function(x) mean(x))
plot(row.variances, row.means, pch=19, main="Mean vs. Variance relationship")

edata <- edata[row.variances < 6,]
edata.log <- log2(edata)

Scale the rows and apply SVD. For exploration, we are proceeding with data data and centered-and-scaled data:

edata.scaled <- t(scale(t(edata.log), scale=TRUE, center=TRUE))
edata.centered <- t(scale(t(edata.log), scale=FALSE, center=TRUE))

svd.centered.out <- svd(edata.centered)
svd.centered.plot <- data.table(svd.centered.out$v[,1:10], pheno)

svd.scaled.out <- svd(edata.scaled)
svd.scaled.plot <- data.table(svd.scaled.out$v[,1:10], pheno)

Visualize the scatterplot, while labeling each sample with information (batch or cancer):

ggplot(svd.centered.plot) + geom_point(aes(x=V1, y=V2, col=as.factor(batch)))

ggplot(svd.centered.plot) + geom_point(aes(x=V1, y=V2, col=as.factor(cancer)))

ggplot(svd.scaled.plot) + geom_point(aes(x=V1, y=V2, col=as.factor(batch)))

ggplot(svd.scaled.plot) + geom_point(aes(x=V1, y=V2, col=as.factor(cancer)))

Homework Problem 1: Create a table to show the batch effects (refer to Figure 1 in Gilad and Mizrahi-Man, 2015). There are 5 batches (pheno$batch); how are biological variables and other variables related to study design are distributed among those 5 batches? Explain what could be a problem. Prepare this into a PDF file. Explanation: The distribution of biological and outcome variables is highly uneven across batches. This can potentially lead to batch effects, confounding, and reduced statistical power.

library(bladderbatch)
library(tidyverse)
library(grid)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:Biobase':
## 
##     combine
## The following object is masked from 'package:BiocGenerics':
## 
##     combine
data(bladderdata)
pheno <- pData(bladderEset)

batch_table <- table(pheno$batch, pheno$cancer, pheno$outcome)

batch_df <- as.data.frame.table(batch_table)
colnames(batch_df) <- c("Batch", "Cancer", "Outcome", "Count")

batch_summary <- batch_df %>%
  filter(Count > 0) %>%  # Only keep combinations that exist
  arrange(Batch, Cancer, Outcome)

batch_summary
##   Batch Cancer  Outcome Count
## 1     1 Cancer     mTCC    11
## 2     2 Cancer     mTCC     1
## 3     2 Cancer sTCC-CIS    13
## 4     2 Normal   Normal     4
## 5     3 Normal   Normal     4
## 6     4 Biopsy   Biopsy     5
## 7     5 Biopsy   Biopsy     4
## 8     5 Cancer sTCC-CIS     3
## 9     5 Cancer sTCC+CIS    12
pivot_table <- batch_df %>%
  pivot_wider(
    id_cols = c("Cancer", "Outcome"),
    names_from = "Batch",
    values_from = "Count",
    values_fill = 0
  )

pivot_table
## # A tibble: 15 × 7
##    Cancer Outcome    `1`   `2`   `3`   `4`   `5`
##    <fct>  <fct>    <int> <int> <int> <int> <int>
##  1 Biopsy Biopsy       0     0     0     5     4
##  2 Cancer Biopsy       0     0     0     0     0
##  3 Normal Biopsy       0     0     0     0     0
##  4 Biopsy mTCC         0     0     0     0     0
##  5 Cancer mTCC        11     1     0     0     0
##  6 Normal mTCC         0     0     0     0     0
##  7 Biopsy Normal       0     0     0     0     0
##  8 Cancer Normal       0     0     0     0     0
##  9 Normal Normal       0     4     4     0     0
## 10 Biopsy sTCC-CIS     0     0     0     0     0
## 11 Cancer sTCC-CIS     0    13     0     0     3
## 12 Normal sTCC-CIS     0     0     0     0     0
## 13 Biopsy sTCC+CIS     0     0     0     0     0
## 14 Cancer sTCC+CIS     0     0     0     0    12
## 15 Normal sTCC+CIS     0     0     0     0     0
# Save PDF
pdf("Swiatkowska_problem1.pdf", width = 12, height = 6)  

grid.table(pivot_table)
grid.text("Batch Effects Table", x = 0.5, y = 0.95, just = "top", gp = gpar(fontsize = 14, fontface = "bold"))

dev.off()
## quartz_off_screen 
##                 2

Linear model with technical variables

Fitting a linear model with the least squares

When technical variables are known, we can add that to a well known linear model. Note that for our own convenience, we tell R to use “Normal” as a base factor:

pheno$cancer = relevel(pheno$cancer, ref="Normal")

We will fit this model on one variable, namely the first gene in gene expression data.

mod = lm(edata[1,] ~ as.factor(pheno$cancer) + as.factor(pheno$batch))
print(mod)
## 
## Call:
## lm(formula = edata[1, ] ~ as.factor(pheno$cancer) + as.factor(pheno$batch))
## 
## Coefficients:
##                   (Intercept)  as.factor(pheno$cancer)Biopsy  
##                       8.48001                        0.40212  
## as.factor(pheno$cancer)Cancer        as.factor(pheno$batch)2  
##                       1.36160                        0.33272  
##       as.factor(pheno$batch)3        as.factor(pheno$batch)4  
##                       1.43092                        0.71854  
##       as.factor(pheno$batch)5  
##                      -0.03006

You now can fit this linear model on all 22283 genes. We look at the coefficients related to the cancer:

pheno$cancer = relevel(pheno$cancer, ref="Normal")
mod = lm(t(edata) ~ as.factor(pheno$cancer) + as.factor(pheno$batch))
names(mod)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "contrasts"     "xlevels"       "call"          "terms"        
## [13] "model"
dim(mod$coefficients)
## [1]     7 22281
rownames(mod$coefficients)
## [1] "(Intercept)"                   "as.factor(pheno$cancer)Biopsy"
## [3] "as.factor(pheno$cancer)Cancer" "as.factor(pheno$batch)2"      
## [5] "as.factor(pheno$batch)3"       "as.factor(pheno$batch)4"      
## [7] "as.factor(pheno$batch)5"
# library "broom" clean up the outputs of LM
# now, we can use ggplot2 to plot various aspects of LM
library(broom)
mod_tidy <- tidy(mod)
ggplot(mod_tidy) + geom_histogram(aes(x=estimate), bins = 100, fill="darkorange")

# however, the previous line of code make a histogram of all coefficients.
# what we need to do is to find estimates of particular regression terms.
mod_tidy %>% filter(term == "as.factor(pheno$cancer)Cancer")
## # A tibble: 22,281 × 6
##    response  term                          estimate std.error statistic  p.value
##    <chr>     <chr>                            <dbl>     <dbl>     <dbl>    <dbl>
##  1 1007_s_at as.factor(pheno$cancer)Cancer   1.36      0.283     4.82    1.39e-5
##  2 1053_at   as.factor(pheno$cancer)Cancer   0.304     0.115     2.65    1.07e-2
##  3 117_at    as.factor(pheno$cancer)Cancer  -0.232     0.246    -0.943   3.50e-1
##  4 121_at    as.factor(pheno$cancer)Cancer  -0.814     0.236    -3.45    1.14e-3
##  5 1255_g_at as.factor(pheno$cancer)Cancer  -0.331     0.0964   -3.43    1.22e-3
##  6 1294_at   as.factor(pheno$cancer)Cancer   0.579     0.238     2.44    1.84e-2
##  7 1316_at   as.factor(pheno$cancer)Cancer  -0.608     0.132    -4.60    2.95e-5
##  8 1320_at   as.factor(pheno$cancer)Cancer  -0.242     0.131    -1.85    7.05e-2
##  9 1405_i_at as.factor(pheno$cancer)Cancer  -0.0375    0.460    -0.0815  9.35e-1
## 10 1431_at   as.factor(pheno$cancer)Cancer  -0.270     0.214    -1.26    2.12e-1
## # ℹ 22,271 more rows
ggplot(mod_tidy %>% filter(term == "as.factor(pheno$cancer)Cancer")) + geom_histogram(aes(x=estimate), bins = 100, fill="darkorange")

# how about the p-values?
ggplot(mod_tidy %>% filter(term == "as.factor(pheno$cancer)Cancer")) + geom_histogram(aes(x=p.value), bins = 100, fill="darkorange")

Explore convenient functions, filter and select. filter that allow you to choose rows based on logical statements based on a specific column. With select, you can select columns.

Empirical Bayes

Using ComBat to clean a dataset

ComBat effectively remove the unwanted variation due to the known and specified technical variables. The batch argument only expect one technical variable. You can also specify a model matrix (as shown model.matrix) which include further adjustment variables. This will return the cleaned data, in which you can apply a linear model.

library(sva)
batch = pheno$batch
combat_edata = ComBat(dat=edata, batch=pheno$batch, mod=model.matrix(~1, data=pheno), par.prior=TRUE, prior.plots=TRUE)
## Found5batches
## Adjusting for0covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors

## Finding parametric adjustments
## Adjusting the Data

Just because I ran a certain algorithm that is designed to remove batch effects doesn’t necessarily mean that batch effects are removed. It is necessarily to check what has been returned:

class(combat_edata)
## [1] "matrix" "array"
dim(combat_edata)
## [1] 22281    57
combat_edata[1:10,1:10]
##           GSM71019.CEL GSM71020.CEL GSM71021.CEL GSM71022.CEL GSM71023.CEL
## 1007_s_at    10.086489     8.593022     8.735539     8.904940    10.279648
## 1053_at       5.519209     5.113330     5.158350     5.262211     5.265270
## 117_at        6.561745     6.432690     6.270195     6.220634     6.020381
## 121_at        8.789972     9.223910     9.316876     9.194753     8.670989
## 1255_g_at     3.899324     4.403977     4.092678     4.221117     4.060227
## 1294_at       7.776515     7.019255     7.158184     6.861669     7.922845
## 1316_at       5.649285     5.970757     5.735519     5.654295     5.419309
## 1320_at       4.804676     4.950993     4.934002     5.054114     4.781217
## 1405_i_at     4.610667     5.134103     5.253515     4.949733     5.716855
## 1431_at       3.611855     3.838787     4.032772     3.929808     3.671817
##           GSM71024.CEL GSM71025.CEL GSM71026.CEL GSM71028.CEL GSM71029.CEL
## 1007_s_at     9.961004     9.045476     8.694422     9.899767    10.045201
## 1053_at       5.369203     5.284901     5.256371     5.510407     5.078401
## 117_at        5.748936     6.228249     6.583219     5.959973     6.176811
## 121_at        7.758164     8.904424     9.060979     8.070845     9.587972
## 1255_g_at     3.829740     4.342484     4.120578     3.705901     4.280065
## 1294_at       7.981089     7.380566     7.191836     7.400875     7.738215
## 1316_at       4.974139     5.673043     6.034637     5.157917     5.495240
## 1320_at       4.565796     5.050048     4.983493     4.667656     4.648607
## 1405_i_at     5.670019     5.446907     4.823131     5.894286     4.899287
## 1431_at       3.630528     4.016914     4.053489     3.672133     3.910500
## compare heatmaps before vs. after
library(gplots)
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(RColorBrewer)
my_palette <- colorRampPalette(c("blue", "white", "darkred"))(n = 299)

edata_sub <- edata[,]

png("bladder.png",height=700,width=700)
heatmap.2(edata,
          main = "Bladder Cancer Data Clustered", # heat map title
          notecol="black",      # change font color of cell labels to black
          density.info="none",  # turns off density plot inside color legend
          trace="none",         # turns off trace lines inside the heat map
          margins =c(12,9),     # widens margins around plot
          col=my_palette,       # use on color palette defined earlier 
          dendrogram="none",     # only draw a row dendrogram
          scale = "row")
dev.off()
## quartz_off_screen 
##                 2
png("bladder_combat.png",height=700,width=700)
heatmap.2(combat_edata,
          main = "Bladder Cancer Data Cleaned by ComBat", # heat map title
          notecol="black",      # change font color of cell labels to black
          density.info="none",  # turns off density plot inside color legend
          trace="none",         # turns off trace lines inside the heat map
          margins =c(12,9),     # widens margins around plot
          col=my_palette,       # use on color palette defined earlier 
          dendrogram="none",     # only draw a row dendrogram
          scale = "row")
dev.off()
## quartz_off_screen 
##                 2

Evaluate if the cleaned data from ComBat has no relation to batch effects:

svd.out.combat <- svd(combat_edata)
svd.combat.plot <- data.table(svd.out.combat$v[,1:10], pheno)

ggplot(svd.combat.plot) + geom_point(aes(x=V1, y=V2, col=as.factor(batch)))

Homework Problem 2: Make heatmaps, BEFORE and AFTER cleaning the data using ComBat, where columns are arranged according to the study design. You must sort the columns such that 5 batches are shown. Cluster the rows, but do not cluster the columns (samples) when drawing a heatmap. The general idea is that you want to see if the Combat-cleaned data are any improvement in the general patterns.

library(bladderbatch)
library(sva)
library(gplots)
library(RColorBrewer)

data(bladderdata)
pheno <- pData(bladderEset)
edata <- exprs(bladderEset)

combat_edata <- ComBat(dat = edata, batch = pheno$batch, mod = model.matrix(~1, data = pheno), par.prior = TRUE, prior.plots = FALSE)
## Found5batches
## Adjusting for0covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data
# sort columns by batch
batch_order <- order(pheno$batch)
edata_sorted <- edata[, batch_order]
combat_edata_sorted <- combat_edata[, batch_order]

my_palette <- colorRampPalette(c("blue", "white", "red"))(n = 299)

# heatmap for the UNCORRECTED data
png("heatmap_uncorrected.png", width = 700, height = 700)
heatmap.2(
  edata_sorted,
  Colv = FALSE,  # Do not cluster columns
  Rowv = TRUE,   # Cluster rows
  col = my_palette,
  scale = "row",  # Scale by row (genes)
  main = "Heatmap of UNCORRECTED Data",
  trace = "none", # Remove trace lines
  dendrogram = "none",
  labCol = pheno$batch[batch_order], # Label columns by batch
  margins = c(12, 9) # Widens margins around plot
)
dev.off()
## quartz_off_screen 
##                 2
# heatmap for the COMBAT-CORRECTED data
png("heatmap_combat_corrected.png", width = 700, height = 700)
heatmap.2(
  combat_edata_sorted,
  Colv = FALSE,  # Do not cluster columns
  Rowv = TRUE,   # Cluster rows
  col = my_palette,
  scale = "row",  # Scale by row (genes)
  main = "Heatmap of ComBat-CORRECTED Data",
  trace = "none", # Remove trace lines
  dendrogram = "none",
  labCol = pheno$batch[batch_order], # Label columns by batch
  margins = c(12, 9) # Widens margins around plot
)
dev.off()
## quartz_off_screen 
##                 2

Homework Problem 3: Make heatmaps of Pearson correlations statistics of samples. For example, see Figure 2 and 3 freom Gilad and Mizrahi-Man (2015) F1000Research: . First, compute the correlation statistics among columns. Second, create a heatmap using heatmap.2(). Make sure to create or add labels for samples (cancer vs. normal; batch numbers; others)

# Pearson correlation
cor_uncorrected <- cor(edata, method="pearson")
cor_combat <- cor(combat_edata, method="pearson")

samples_labels <- paste("Batch", pheno$batch, "-", pheno$cancer)

png("heatmap_correlation_uncorrected.png", width = 700, height = 700)
heatmap.2(
  cor_uncorrected,
  Colv = TRUE,  # Cluster columns
  Rowv = TRUE,  # Cluster rows
  col = my_palette,
  scale = "none",  # No scaling (correlation values are already normalized)
  main = "Heatmap (UNCORRECTED)",
  trace = "none", # Remove trace lines
  dendrogram = "both", # Show for rows and columns
  labRow = samples_labels, # Label rows with batch and cancer status
  labCol = samples_labels, # Label columns with batch and cancer status
  margins = c(12, 9) # Widens margins around plot
)
dev.off()
## quartz_off_screen 
##                 2
# Create a heatmap for the COMBAT-CORRECTED correlation matrix
png("heatmap_correlation_combat_corrected.png", width = 700, height = 700)
heatmap.2(
  cor_combat,
  Colv = TRUE,  # Cluster columns
  Rowv = TRUE,  # Cluster rows
  col = my_palette,
  scale = "none",  # No scaling (correlation values are already normalized)
  main = "Heatmap (ComBat-CORRECTED)",
  trace = "none", # Remove trace lines
  dendrogram = "both", # Show for rows and columns
  labRow = samples_labels, # Label rows with batch and cancer status
  labCol = samples_labels, # Label columns with batch and cancer status
  margins = c(12, 9) # Widens margins around plot
)
dev.off()
## quartz_off_screen 
##                 2

Now we can fit the linear model with a cleaned data.

modcombat = lm(t(combat_edata) ~ as.factor(pheno$cancer))

# library "broom" clean up the outputs of LM
# now, we can use ggplot2 to plot various aspects of LM
library(broom)
modcombat_tidy <- tidy(modcombat)

# histogram of estimates of particular regression terms.
ggplot(modcombat_tidy %>% filter(term == "as.factor(pheno$cancer)Cancer")) + geom_histogram(aes(x=estimate), bins = 100, fill="darkorange")

# different way of looking at coefficients from many different models
# the vertical line indicates the zero coefficient.
ggplot(modcombat_tidy, aes(estimate, term)) +
     geom_point() +
     geom_vline(xintercept = 0)

Compare the empirical Bayes estimates to the conventional linear models. In the scatter plot below, the red line indicates the identity. The blue line indicates the linear line that has been fitted into the actual estimates from two approaches. What we observe is that the estimates from the ComBat-cleaned data are shrunken towards 0 compared to the estimates from the previous linear models. This phenomenon is called shrinkage or regularization, which plays a critical role in high-dimensional data analysis.

lm_genes <- mod_tidy %>% 
  filter(term == "as.factor(pheno$cancer)Cancer") %>% 
  pull(response)

combat_genes <- modcombat_tidy %>% 
  filter(term == "as.factor(pheno$cancer)Cancer") %>% 
  pull(response)

# Find common genes between both models
common_genes <- intersect(lm_genes, combat_genes)

# Create a tibble with only the common genes
est_compare <- tibble(
  LinearModel = mod_tidy %>% 
    filter(term == "as.factor(pheno$cancer)Cancer", response %in% common_genes) %>% 
    arrange(response) %>%  # Ensure same order
    pull(estimate),
)
  
  ComBat = modcombat_tidy %>% 
    filter(term == "as.factor(pheno$cancer)Cancer", response %in% common_genes) %>% 
    arrange(response) %>%  # Ensure same order
    pull(estimate)

ggplot(est_compare, aes(x=LinearModel, y=ComBat)) +
     geom_point(col="darkgrey", alpha=.5, size=.5) + geom_abline(intercept=0, slope=1, col="darkred") + geom_smooth(method = "lm", se = TRUE)  + theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

Let’s look at the p-values. The majority of variables are still very significant, although it’s less so that p-values from the least square method.

ggplot(modcombat_tidy %>% filter(term == "as.factor(pheno$cancer)Cancer")) + geom_histogram(aes(x=p.value), bins = 100, fill="darkorange")

Homework Problem 4: Apply two different Linear Models to the Bottomly et al. data. First, using a conventional approach, create a linear model with a genetic strain (biological variable) and an experimental number (technical variable) on uncorrected gene expression data.

Second, create a linear model with a genetic strain (biological variables) on corrected gene expression data from ComBat. Make a scatter plots of coefficients and a histogram of p-values as done in this notebook. Make sure that you are pulling out the correct coefficients, not any or all coefficients.

library(sva)
library(broom)
library(ggplot2)
library(data.table)

edata <- as.matrix(exprs(bottomly.eset))
pheno <- pData(bottomly.eset)

combat_edata <- ComBat(dat = edata, batch = pheno$experiment.number, mod = model.matrix(~1, data = pheno), par.prior = TRUE, prior.plots = FALSE)
## Found 24229 genes with uniform expression within a single batch (all zeros); these will not be adjusted for batch.
## Found3batches
## Adjusting for0covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data
mod_uncorrected <- lm(t(edata) ~ as.factor(pheno$strain) + as.factor(pheno$experiment.number))
mod_uncorrected_tidy <- tidy(mod_uncorrected)

mod_combat <- lm(t(combat_edata) ~ as.factor(pheno$strain))
mod_combat_tidy <- tidy(mod_combat)

unique(mod_combat_tidy$term)
## [1] "(Intercept)"                   "as.factor(pheno$strain)DBA/2J"
unique(mod_uncorrected_tidy$term)
## [1] "(Intercept)"                         "as.factor(pheno$strain)DBA/2J"      
## [3] "as.factor(pheno$experiment.number)6" "as.factor(pheno$experiment.number)7"
# p-values
ggplot(mod_uncorrected_tidy %>% filter(term == "as.factor(pheno$strain)DBA/2J"), aes(x = p.value)) +
  geom_histogram(bins = 100, fill = "darkorange") +
  theme_bw() +
  labs(title = "P-values for Strain (Uncorrected Data)")
## Warning: Removed 22604 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggplot(mod_combat_tidy %>% filter(term == "as.factor(pheno$strain)DBA/2J"), aes(x = p.value)) +
  geom_histogram(bins = 100, fill = "darkorange") +
  theme_bw() +
  labs(title = "P-values for Strain (ComBat-Corrected Data)")
## Warning: Removed 22604 rows containing non-finite outside the scale range
## (`stat_bin()`).

est_compare <- tibble(
  Uncorrected = mod_uncorrected_tidy %>% filter(term == "as.factor(pheno$strain)DBA/2J") %>% pull(estimate),
  ComBat = mod_combat_tidy %>% filter(term == "as.factor(pheno$strain)DBA/2J") %>% pull(estimate)
)

# Scatter plot of coefficients
ggplot(est_compare, aes(x = Uncorrected, y = ComBat)) +
  geom_point(col = "darkgrey", alpha = 0.5, size = 0.5) +
  geom_abline(intercept = 0, slope = 1, col = "darkred") +
  geom_smooth(method = "lm", se = TRUE) +
  theme_bw() +
  labs(title = "Scatter Plot of Coefficients (Uncorrected vs. ComBat)",
       x = "Uncorrected Coefficients",
       y = "ComBat Coefficients")
## `geom_smooth()` using formula = 'y ~ x'

Surrogate Variable Analysis (SVA)

Finding a dimension of surrogate variables

When the technical variables are not known or there are additional dependence across the noise term, SVA can be used to estimate and correct for a dependence kernel.

The hyper parameter required for SVA is the number of surrogate variables. This is very challenging with numerous methods available. The package SVA provides two methods to estimate the number of surrogate variables, n.sv.

library(bladderbatch)
data(bladderdata)
edata <- exprs(bladderEset)
pheno <- pData(bladderEset)
edata <- as.matrix(edata)
mod = model.matrix(~as.factor(cancer), data=pheno)

set.seed(1)
rnorm(1)
## [1] -0.6264538
# permutation procedure from Buja and Eyuboglu 1992
num.sv(edata,mod,method="be")
## [1] 9
# asymptotic approach from Leek 2011 Biometrics.
num.sv(edata,mod,method="leek")
## [1] 2

We will go with the Leek 2011 method, e.g., method="leek".

Estimating surrogate variables (SVs)

We fit SVA without specifying any known technical variables. Essentially, we are hoping that SVA can recover the batch effects (including 5 batches that we know).

mod = model.matrix(~as.factor(cancer),data=pheno)
mod0 = model.matrix(~1, data=pheno)
sva_output = sva(edata, mod, mod0, n.sv=num.sv(edata,mod,method="leek"))
## Number of significant surrogate variables is:  2 
## Iteration (out of 5 ):1  2  3  4  5

Once SVs are estimated, we proceed to check how they may be related to the known technical variables. See the LM output:

head(sva_output$sv)
##              [,1]         [,2]
## [1,] -0.027168731  0.003886997
## [2,] -0.006180066  0.198882692
## [3,]  0.077370249  0.197930599
## [4,] -0.001137525  0.059063230
## [5,] -0.020617855 -0.044496006
## [6,]  0.085287640 -0.191534921
# summary shows how the batches are related to SV1 and SV2 separately.
# which SV have more information about pheno$batch? 
summary(lm(sva_output$sv ~ pheno$batch))
## Response Y1 :
## 
## Call:
## lm(formula = Y1 ~ pheno$batch)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26953 -0.11076  0.00787  0.10399  0.19069 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.018471   0.038694  -0.477    0.635
## pheno$batch  0.006051   0.011253   0.538    0.593
## 
## Residual standard error: 0.1345 on 55 degrees of freedom
## Multiple R-squared:  0.00523,    Adjusted R-squared:  -0.01286 
## F-statistic: 0.2892 on 1 and 55 DF,  p-value: 0.5929
## 
## 
## Response Y2 :
## 
## Call:
## lm(formula = Y2 ~ pheno$batch)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.23973 -0.07467 -0.02157  0.08117  0.25629 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.121112   0.034157   3.546 0.000808 ***
## pheno$batch -0.039675   0.009933  -3.994 0.000194 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1187 on 55 degrees of freedom
## Multiple R-squared:  0.2248, Adjusted R-squared:  0.2108 
## F-statistic: 15.95 on 1 and 55 DF,  p-value: 0.0001944

Visualizing and exploring SVs

Now, perhaps that SV2 and SV3 are strongly related to the batch effect (i.e. technical variable).

Lets make the scatter plot using SV1 and SV2. The data points are colored by their pheno data.

sva_batch <- tibble(SV1=sva_output$sv[,1],
                  SV2=sva_output$sv[,2],
                  batch=as.factor(pheno$batch),
                  cancer=as.factor(pheno$cancer),
                  outcome=as.factor(pheno$outcome))

ggplot(sva_batch) + geom_point(aes(x=SV1,y=SV2, col=batch))

ggplot(sva_batch) + geom_point(aes(x=SV1,y=SV2, col=cancer))

ggplot(sva_batch) + geom_point(aes(x=SV1,y=SV2, col=outcome))

We further make the violin plots of values of SVs, stratified by the five batches. If the values of SVs are separately (differentially distributed) among batches, that may be an evidence that SVA are capturing the known technical variables.

sva_batch <- tibble(SV1=sva_output$sv[,1],
                  SV2=sva_output$sv[,2],
                  batch=as.factor(pheno$batch))
sva_batch_gather <- gather(sva_batch,"sv","value",-batch)

ggplot(sva_batch_gather) + geom_violin(aes(x=batch,y=value)) + facet_wrap(~ sv, ncol = 1)

ggplot(sva_batch_gather) + geom_violin(aes(x=batch,y=value)) + facet_wrap(~ sv, ncol = 1) + geom_jitter(aes(x=batch,y=value,col=batch))

It seems that 2 surrogate variables (SVs) contain substantial information about a known technical variable. Therefore, we proceed to fit the model.

Note that the following code to visualize estimates are rather complex and long. We are using filter to choose rows (cancer factors) and select to choose estimates (coefficients).

Fitting a LM with surrogate variables

# Add the surrogate variables to the model matrix
modsva = lm(t(edata) ~ as.factor(pheno$cancer) + cbind(sva_output$sv))
modsva_tidy <- tidy(modsva)

# Get the gene identifiers from all three result sets
lm_genes <- mod_tidy %>% 
  filter(term == "as.factor(pheno$cancer)Cancer") %>% 
  pull(response)

combat_genes <- modcombat_tidy %>% 
  filter(term == "as.factor(pheno$cancer)Cancer") %>% 
  pull(response)

sva_genes <- modsva_tidy %>% 
  filter(term == "as.factor(pheno$cancer)Cancer") %>% 
  pull(response)

# Find common genes across all three models
common_genes <- Reduce(intersect, list(lm_genes, combat_genes, sva_genes))

est_compare <- tibble(
  LinearModel = mod_tidy %>% 
    filter(term == "as.factor(pheno$cancer)Cancer", response %in% common_genes) %>% 
    arrange(response) %>%  # Ensure same order
    pull(estimate),
  
  ComBat = modcombat_tidy %>% 
    filter(term == "as.factor(pheno$cancer)Cancer", response %in% common_genes) %>% 
    arrange(response) %>%  # Ensure same order
    pull(estimate),
  
  SVA = modsva_tidy %>% 
    filter(term == "as.factor(pheno$cancer)Cancer", response %in% common_genes) %>% 
    arrange(response) %>%  # Ensure same order
    pull(estimate)
)

ggplot(est_compare, aes(x=LinearModel, y=SVA)) +
  geom_point(col="darkgrey", alpha=.5, size=.5) + 
  geom_abline(intercept=0, slope=1, col="darkred") + 
  geom_smooth(method = "lm", se = TRUE) + 
  theme_bw() +
  labs(
    title = "Comparison of Linear Model vs. SVA Coefficients",
    x = "Linear Model Estimates",
    y = "SVA Estimates"
  )
## `geom_smooth()` using formula = 'y ~ x'

ggplot(est_compare, aes(x=ComBat, y=SVA)) +
  geom_point(col="darkgrey", alpha=.5, size=.5) + 
  geom_abline(intercept=0, slope=1, col="darkred") + 
  geom_smooth(method = "lm", se = TRUE) + 
  theme_bw() +
  labs(
    title = "Comparison of ComBat vs. SVA Coefficients",
    x = "ComBat Estimates",
    y = "SVA Estimates"
  )
## `geom_smooth()` using formula = 'y ~ x'

At last, let’s look at the p-values from SVA.

ggplot(modsva_tidy %>% filter(term == "as.factor(pheno$cancer)Cancer")) + geom_histogram(aes(x=p.value), bins = 100, fill="darkorange")

It seems that even though the estimates are shrunken towards to zero and the surrogate variables have approximated the technical variables well, the p-values as a whole may not changed so much.

#unique_terms_mod <- unique(mod_tidy$term)
#unique_terms_combat <- unique(modcombat_tidy$term)
#unique_terms_sva <- unique(modsva_tidy$term)

# how many rows match the filter in each dataset
#linear_rows <- sum(mod_tidy$term == "as.factor(pheno$cancer)Cancer")
#combat_rows <- sum(modcombat_tidy$term == "as.factor(pheno$cancer)Cancer")
#sva_rows <- sum(modsva_tidy$term == "as.factor(pheno$cancer)Cancer")

#print(paste("LinearModel matching rows:", linear_rows))
#print(paste("ComBat matching rows:", combat_rows))
#print(paste("SVA matching rows:", sva_rows))

pvalues_linear <- tibble(key = "LinearModel", 
                         value = mod_tidy %>% filter(term == "as.factor(pheno$cancer)Cancer") %>% pull(p.value))
pvalues_combat <- tibble(key = "ComBat", 
                         value = modcombat_tidy %>% filter(term == "as.factor(pheno$cancer)Cancer") %>% pull(p.value))
pvalues_sva <- tibble(key = "SVA", 
                      value = modsva_tidy %>% filter(term == "as.factor(pheno$cancer)Cancer") %>% pull(p.value))

# Combine them with bind_rows
pvalues_gather <- bind_rows(pvalues_linear, pvalues_combat, pvalues_sva)

# Plot
ggplot(pvalues_gather, aes(x=value)) + geom_histogram() + facet_wrap(~key)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# pi0 from the original data ~ 0.26
# pi0 from a combat-cleaned data ~ 0.28
# pi0 from SVA ~ 0.27

Homework Problem 5: Apply ComBat and SVA to the Bottomly et al. data. Make a scatter plots of coefficients and a histogram of p-values, comparing results based on ComBat and SVA. Assume that the biological variables in Bottomly et al data is the genetic strains. Make sure that you are pulling out the correct coefficients/pvalues, not any or all of them.

library(sva)
library(broom)
library(ggplot2)
library(tidyr)

edata <- as.matrix(exprs(bottomly.eset))
pheno <- pData(bottomly.eset)

# Filter low expressed genes
edata <- edata[rowMeans(edata) > 10, ]
edata <- log2(edata + 1)

# Remove any rows with NA values
na_rows <- apply(edata, 1, function(x) any(is.na(x)))
edata <- edata[!na_rows, ]

# ComBat correction
mod_for_combat <- model.matrix(~1, data=pheno)
combat_edata <- ComBat(dat=edata, batch=pheno$experiment.number, mod=mod_for_combat, par.prior=TRUE, prior.plots=FALSE)
## Found3batches
## Adjusting for0covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data
# model matrices for SVA
mod <- model.matrix(~as.factor(pheno$strain), data=pheno)
mod0 <- model.matrix(~1, data=pheno)

sva_output <- sva(edata, mod, mod0, n.sv=num.sv(edata, mod, method="leek"))
## Number of significant surrogate variables is:  2 
## Iteration (out of 5 ):1  2  3  4  5
# Linear models
mod_combat <- lm(t(combat_edata) ~ as.factor(pheno$strain))
mod_combat_tidy <- tidy(mod_combat)

mod_sva <- lm(t(edata) ~ as.factor(pheno$strain) + sva_output$sv)
mod_sva_tidy <- tidy(mod_sva)

# coefficients and p-values
combat_coef <- mod_combat_tidy %>% 
  filter(term == "as.factor(pheno$strain)DBA/2J") %>% 
  pull(estimate)

combat_pvals <- mod_combat_tidy %>% 
  filter(term == "as.factor(pheno$strain)DBA/2J") %>% 
  pull(p.value)

sva_coef <- mod_sva_tidy %>% 
  filter(term == "as.factor(pheno$strain)DBA/2J") %>% 
  pull(estimate)

sva_pvals <- mod_sva_tidy %>% 
  filter(term == "as.factor(pheno$strain)DBA/2J") %>% 
  pull(p.value)

# data frames for plotting
coef_compare <- data.frame(ComBat = combat_coef, SVA = sva_coef)
coef_compare <- na.omit(coef_compare) # na.omit() handle potential NAs in results after model fitting

combat_pvals_df <- data.frame(p.value = combat_pvals)
sva_pvals_df <- data.frame(p.value = sva_pvals)

# Plot coefficients
ggplot(coef_compare, aes(x = ComBat, y = SVA)) +
  geom_point(col = "darkgrey", alpha = 0.5, size = 0.5) +
  geom_abline(intercept = 0, slope = 1, col = "darkred") +
  geom_smooth(method = "lm", se = TRUE) +
  theme_bw() +
  labs(title = "Coefficients: ComBat vs. SVA",
       x = "ComBat Coefficients",
       y = "SVA Coefficients")
## `geom_smooth()` using formula = 'y ~ x'

# Plot p-values
ggplot(combat_pvals_df, aes(x = p.value, fill = Method)) +
  geom_histogram(aes(x=p.value), bins = 100, fill="darkorange") +
  theme_bw() +
  labs(title = "P-values Distribution Combat")

ggplot(sva_pvals_df, aes(x = p.value, fill = Method)) +
  geom_histogram(aes(x=p.value), bins = 100, fill="darkorange")+
  theme_bw() +
  labs(title = "P-values Distribution SVA")